{
 "cells": [
  {
   "cell_type": "markdown",
   "id": "adf7d63d",
   "metadata": {},
   "source": [
    "<a href=\"https://colab.research.google.com/github/jerryjliu/llama_index/blob/main/docs/examples/managed/vectaraDemo.ipynb\" target=\"_parent\"><img src=\"https://colab.research.google.com/assets/colab-badge.svg\" alt=\"Open In Colab\"/></a>"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "db0855d0",
   "metadata": {},
   "source": [
    "# Managed Index with Zilliz Cloud Pipelines\n",
    "\n",
    "[Zilliz Cloud Pipelines](https://docs.zilliz.com/docs/pipelines) is a scalable API service for retrieval. You can use Zilliz Cloud Pipelines as managed index in `llama-index`. This service can transform documents into vector embeddings and store them in Zilliz Cloud for effective semantic search.\n",
    "\n",
    "## Setup\n",
    "\n",
    "1. Install llama-index"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6019e01a",
   "metadata": {},
   "outputs": [],
   "source": [
    "# ! pip install llama-index"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "3fc94b2f",
   "metadata": {},
   "source": [
    "2. Configure credentials of your [OpenAI](https://platform.openai.com) & [Zilliz Cloud](https://cloud.zilliz.com/signup?utm_source=twitter&utm_medium=social%20&utm_campaign=2023-12-22_social_pipeline-llamaindex_twitter) accounts."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "c2d1c538",
   "metadata": {},
   "outputs": [],
   "source": [
    "from getpass import getpass\n",
    "import os\n",
    "\n",
    "os.environ[\"OPENAI_API_KEY\"] = getpass(\"Enter your OpenAI API Key:\")\n",
    "\n",
    "ZILLIZ_PROJECT_ID = getpass(\"Enter your Zilliz Project ID:\")\n",
    "ZILLIZ_CLUSTER_ID = getpass(\"Enter your Zilliz Cluster ID:\")\n",
    "ZILLIZ_TOKEN = getpass(\"Enter your Zilliz API Key:\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "200b4fe0",
   "metadata": {},
   "source": [
    "> [Find your OpenAI API key](https://beta.openai.com/account/api-keys)\n",
    ">\n",
    "> [Find your Zilliz Cloud credentials](https://docs.zilliz.com/docs/on-zilliz-cloud-console)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "5d3c5b5f",
   "metadata": {},
   "source": [
    "## Indexing documents\n",
    "\n",
    "### From Signed URL\n",
    "\n",
    "Zilliz Cloud Pipelines accepts files from AWS S3 and Google Cloud Storage. You can generate a presigned url from the Object Storage and use `from_document_url()` or `insert_doc_url()` to ingest the file. It can automatically index the document and store the doc chunks as vectors on Zilliz Cloud."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "97d5c934",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No available pipelines. Please create pipelines first.\n",
      "Pipelines are automatically created.\n"
     ]
    },
    {
     "data": {
      "text/plain": [
       "{'token_usage': 984, 'doc_name': 'milvus_doc_22.md', 'num_chunks': 7}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "from llama_index.indices import ZillizCloudPipelineIndex\n",
    "\n",
    "zcp_index = ZillizCloudPipelineIndex.from_document_url(\n",
    "    # a public or pre-signed url of a file stored on AWS S3 or Google Cloud Storage\n",
    "    url=\"https://publicdataset.zillizcloud.com/milvus_doc.md\",\n",
    "    project_id=ZILLIZ_PROJECT_ID,\n",
    "    cluster_id=ZILLIZ_CLUSTER_ID,\n",
    "    token=ZILLIZ_TOKEN,\n",
    "    # optional\n",
    "    metadata={\"version\": \"2.3\"},  # used for filtering\n",
    "    collection_name=\"zcp_llamalection\",  # change this value will specify customized collection name\n",
    ")\n",
    "\n",
    "# Insert more docs, eg. a Milvus v2.2 document\n",
    "zcp_index.insert_doc_url(\n",
    "    url=\"https://publicdataset.zillizcloud.com/milvus_doc_22.md\",\n",
    "    metadata={\"version\": \"2.2\"},\n",
    ")\n",
    "\n",
    "# # Delete docs by doc name\n",
    "# zcp_index.delete_by_doc_name(doc_name=\"milvus_doc_22.md\")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d16a498e",
   "metadata": {},
   "source": [
    "> It is optional to add metadata for each document. The metadata can be used to filter doc chunks during retrieval.\n",
    "\n",
    "### From Local File\n",
    "\n",
    "Coming soon.\n",
    "\n",
    "### From Raw Text\n",
    "\n",
    "Coming soon."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "c94365ab",
   "metadata": {},
   "source": [
    "## Working as Query Engine\n",
    "\n",
    "To conduct semantic search with `ZillizCloudPipelineIndex`, you can use it `as_query_engine()` by specifying a few parameters:\n",
    "- **search_top_k**: How many text nodes/chunks to retrieve. Optional, defaults to `DEFAULT_SIMILARITY_TOP_K` (2).\n",
    "- **filters**: Metadata filters. Optional, defaults to None.\n",
    "- **output_metadata**: What metadata fields to return with the retrieved text node. Optional, defaults to []."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "dafda7a1",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters\n",
    "\n",
    "query_engine_milvus23 = zcp_index.as_query_engine(\n",
    "    search_top_k=3,\n",
    "    filters=MetadataFilters(\n",
    "        filters=[\n",
    "            ExactMatchFilter(key=\"version\", value=\"2.3\")\n",
    "        ]  # version == \"2.3\"\n",
    "    ),\n",
    "    output_metadata=[\"version\"],\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "9803232e",
   "metadata": {},
   "source": [
    "Then the query engine is ready for Semantic Search or Retrieval Augmented Generation with Milvus 2.3 documents:\n",
    "\n",
    "- **Retrieve** (Semantic search powered by Zilliz Cloud Pipelines):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "8ab92af7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "[NodeWithScore(node=TextNode(id_='447198459513870883', embedding=None, metadata={'version': '2.3'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\\nThis topic describes how to delete entities in Milvus.  \\nMilvus supports deleting entities by primary key or complex boolean expressions. Deleting entities by primary key is much faster and lighter than deleting them by complex boolean expressions. This is because Milvus executes queries first when deleting data by complex boolean expressions.  \\nDeleted entities can still be retrieved immediately after the deletion if the consistency level is set lower than Strong.\\nEntities deleted beyond the pre-specified span of time for Time Travel cannot be retrieved again.\\nFrequent deletion operations will impact the system performance.  \\nBefore deleting entities by comlpex boolean expressions, make sure the collection has been loaded.\\nDeleting entities by complex boolean expressions is not an atomic operation. Therefore, if it fails halfway through, some data may still be deleted.\\nDeleting entities by complex boolean expressions is supported only when the consistency is set to Bounded. For details, see Consistency.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), score=0.728226900100708), NodeWithScore(node=TextNode(id_='447198459513870886', embedding=None, metadata={'version': '2.3'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\\n## Prepare boolean expression\\n### Complex boolean expression\\nTo filter entities that meet specific conditions, define complex boolean expressions.  \\nFilter entities whose word_count is greater than or equal to 11000:  \\n```python\\nexpr = \"word_count >= 11000\"\\n```  \\nFilter entities whose book_name is not Unknown:  \\n```python\\nexpr = \"book_name != Unknown\"\\n```  \\nFilter entities whose primary key values are greater than 5 and word_count is smaller than or equal to 9999:  \\n```python\\nexpr = \"book_id > 5 && word_count <= 9999\"\\n```', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), score=0.687866747379303), NodeWithScore(node=TextNode(id_='447198459513870884', embedding=None, metadata={'version': '2.3'}, excluded_embed_metadata_keys=[], excluded_llm_metadata_keys=[], relationships={}, text='# Delete Entities\\n## Prepare boolean expression\\nPrepare the boolean expression that filters the entities to delete.  \\nMilvus supports deleting entities by primary key or complex boolean expressions. For more information on expression rules and supported operators, see Boolean Expression Rules.', start_char_idx=None, end_char_idx=None, text_template='{metadata_str}\\n\\n{content}', metadata_template='{key}: {value}', metadata_seperator='\\n'), score=0.6814976334571838)]\n"
     ]
    }
   ],
   "source": [
    "question = \"Can users delete entities by filtering non-primary fields?\"\n",
    "retrieved_nodes = query_engine_milvus23.retrieve(question)\n",
    "print(retrieved_nodes)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "d47d322d",
   "metadata": {},
   "source": [
    "> The query engine with filters retrieves only text nodes with \\\"version 2.3\\\" tag."
   ]
  },
  {
   "cell_type": "markdown",
   "id": "e91c5896",
   "metadata": {},
   "source": [
    "- **Query** (RAG powered by Zilliz Cloud Pipelines as retriever and OpenAI's LLM):"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "fc7b01b7",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Yes, users can delete entities by filtering non-primary fields using complex boolean expressions in Milvus. The complex boolean expressions allow users to define specific conditions to filter entities based on non-primary fields, such as word_count or book_name. By specifying the desired conditions in the boolean expression, users can delete entities that meet those conditions. However, it is important to note that deleting entities by complex boolean expressions is not an atomic operation, and if it fails halfway through, some data may still be deleted.\n"
     ]
    }
   ],
   "source": [
    "response = query_engine_milvus23.query(question)\n",
    "print(response.response)"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "4c3c239b",
   "metadata": {},
   "source": [
    "## Advanced\n",
    "\n",
    "You are able to get the managed index without running data ingestion. In order to get ready with Zilliz Cloud Pipelines, you need to provide either pipeline ids or collection name:\n",
    "\n",
    "- pipeline_ids: The dictionary of pipeline ids for INGESTION, SEARCH, DELETION. Defaults to None. For example: {\"INGESTION\": \"pipe-xx1\", \"SEARCH\": \"pipe-xx2\", \"DELETION\": \"pipe-xx3\"}.\n",
    "- collection_name: The collection name, defaults to 'zcp_llamalection'. If no pipeline_ids is given, the index will try to get pipelines with collection_name."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "857746c4",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "No available pipelines. Please create pipelines first.\n"
     ]
    }
   ],
   "source": [
    "from llama_index.indices import ZillizCloudPipelineIndex\n",
    "\n",
    "\n",
    "advanced_zcp_index = ZillizCloudPipelineIndex(\n",
    "    project_id=ZILLIZ_PROJECT_ID,\n",
    "    cluster_id=ZILLIZ_CLUSTER_ID,\n",
    "    token=ZILLIZ_TOKEN,\n",
    "    collection_name=\"zcp_llamalection_advanced\",\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "91f99675",
   "metadata": {},
   "source": [
    "### Customize Pipelines\n",
    "\n",
    "If no pipelines are provided or found, then you can manually create and customize pipelines with the following **optional** parameters:\n",
    "\n",
    "- **metadata_schema**: A dictionary of metadata schema with field name as key and data type as value. For example, {\"user_id\": \"VarChar\"}.\n",
    "- **chunkSize**: An integer of chunk size using token as unit. If no chunk size is specified, then Zilliz Cloud Pipeline will use a built-in default chunk size (500 tokens) to split documents.\n",
    "- **(others)**: Refer to [Zilliz Cloud Pipelines](https://docs.zilliz.com/docs/pipelines) for more available pipeline parameters."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "51079a30",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'INGESTION': 'pipe-220572b2597efba9a91ed5',\n",
       " 'SEARCH': 'pipe-8de59599229631c72d4d2c',\n",
       " 'DELETION': 'pipe-2813fbf9eb09b352e81efa'}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "advanced_zcp_index.create_pipelines(\n",
    "    metadata_schema={\"user_id\": \"VarChar\"},\n",
    "    chunkSize=350,\n",
    "    # other pipeline params\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "a6efa3b9",
   "metadata": {},
   "source": [
    "### Multi-Tenancy\n",
    "\n",
    "With the tenant-specific value (eg. user id) as metadata, the managed index is able to achieve multi-tenancy by applying metadata filters.\n",
    "\n",
    "By specifying metadata value, each document is tagged with the tenant-specific field at ingestion."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "5ddd53a6",
   "metadata": {},
   "outputs": [
    {
     "data": {
      "text/plain": [
       "{'token_usage': 1247, 'doc_name': 'milvus_doc.md', 'num_chunks': 10}"
      ]
     },
     "execution_count": null,
     "metadata": {},
     "output_type": "execute_result"
    }
   ],
   "source": [
    "advanced_zcp_index.insert_doc_url(\n",
    "    url=\"https://publicdataset.zillizcloud.com/milvus_doc.md\",\n",
    "    metadata={\"user_id\": \"user_001\"},\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "62a88e50",
   "metadata": {},
   "source": [
    "Then the managed index is able to build a query engine for each tenant by filtering the tenant-specific field."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "6e684e22",
   "metadata": {},
   "outputs": [],
   "source": [
    "from llama_index.vector_stores.types import ExactMatchFilter, MetadataFilters\n",
    "\n",
    "query_engine_for_user_001 = advanced_zcp_index.as_query_engine(\n",
    "    search_top_k=3,\n",
    "    filters=MetadataFilters(\n",
    "        filters=[ExactMatchFilter(key=\"user_id\", value=\"user_001\")]\n",
    "    ),\n",
    "    output_metadata=[\"user_id\"],  # optional, display user_id in outputs\n",
    ")"
   ]
  },
  {
   "cell_type": "markdown",
   "id": "eae76dea",
   "metadata": {},
   "source": [
    "> Change `filters` to build query engines with different conditions."
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "id": "ea6899f9",
   "metadata": {},
   "outputs": [
    {
     "name": "stdout",
     "output_type": "stream",
     "text": [
      "Yes, you can delete entities by filtering non-primary fields. Milvus supports deleting entities by complex boolean expressions, which allows you to filter entities based on specific conditions on non-primary fields. You can define complex boolean expressions using operators such as greater than or equal to, not equal to, and logical operators like AND and OR. By using these expressions, you can filter entities based on the values of non-primary fields and delete them accordingly.\n"
     ]
    }
   ],
   "source": [
    "question = \"Can I delete entities by filtering non-primary fields?\"\n",
    "\n",
    "# search_results = query_engine_for_user_001.retrieve(question)\n",
    "response = query_engine_for_user_001.query(question)\n",
    "print(response.response)"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 5
}
